Red Wine Quality Project by Hui Xu

## [1] "/Users/HuiXu/Desktop/Udacity/EDA/Project/EDA Project"
## [1] "EDA Projectv2.20171209.rmd"  "EDA_Projectv2.20171209.html"
## [3] "EDA_Projectv2.20171209.rmd"  "wineQualityReds.csv"

This report explores a dataset containing quality and chemical properties of 1599 red wines.

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

The table above shows the structure of the dataset we plan to explore. We can find that there are 13 variables in this dataset, including the index variable X, dependent variable quality and other independent variables, which may affect quality of red wine and there are 1599 observations in total.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality
##  Min.   : 8.40   3: 10  
##  1st Qu.: 9.50   4: 53  
##  Median :10.20   5:681  
##  Mean   :10.42   6:638  
##  3rd Qu.:11.10   7:199  
##  Max.   :14.90   8: 18

The above is the statistics summary of our dataset.we can find the minimum, maximum, median, mean, 1st quantile, 3rd quantile of each variable easily. In the next step, I plan to draw the histgrams of each variable by using a function.The following is the function to create a histgram.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

we can see from the graph above that quality is a discrete variable. Most wines’ quaities concentrate on 5 and 6.

Fixed acidity and volatile acidity are both about the acidity of the red wine, so I would like to put the histgrams of these two characteristics together. The graphs are shown as follows:

The histgrams above both all skewed to the right, so I plot these values on a log scale for these variables.

Since fixed acidity and volatile acidity are all about the acidity of red wine, I also would like to create a new variable named total acidity to test the relationship of red wines’ total acidity and quality. The histgram of the total acidity is as follows:

The histgram of the total acidity with no data transformation is skewed to right with most red wines of total acidity concentrating on the range of 7 and 8.5 and some outliers exceeding 15. the distribution of the values on a log scale seems to be a normal distribution.

The distribution of log(residual sugars) of these red wines is skewed to the right. Most log(residual sugars) are 1.59 to 2.81, with some outliers larger than 8.

## Warning: Removed 1404 rows containing non-finite values (stat_bin).

The distribution of chlorides of these red wines is skewed to the right, with some outliers larger than 0.2. So I draw another graph excluding these outliers. We can find that the colorides of most red wines concentrate on 0.1 to 0.125.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    22.0    38.0    46.5    62.0   289.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    14.0    15.9    21.0    72.0

The distributions of total sulfur dioxide and free sulfur dioxide are both skewed to the right. Most red wines’ total sulfur dioxide concentrate on 22 to 62 while most red wines’ free sulfur dioxide is from 7 to 21. I also plot these values on a log scale. It seems that the distribution of log(total sulfur dioxide) is a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.74    3.21    3.31    3.31    3.40    4.01
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.990   0.996   0.997   0.997   0.998   1.004

The pH and density of the red wine follows a normal distribution. Most red wines’ pH is from 3.21 to 3.4 while their density is from 0.9956 to 0.9978.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.330   0.550   0.620   0.658   0.730   2.000

The distribution of sulphates is skewed to the right, most values with some outliers larger than 1.25. Most red wines have a sulphates between 0.55 and 0.73.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.4     9.5    10.2    10.4    11.1    14.9

The distribution of alcohol is skewed to the right. Most red wines have a alcohol between 9.5 and 11.10, with median 10.2 and mean 10.42.

Univariate Analysis

Structure of Dataset

This dataset is about the chemical properties and the quality of red wines. It consists of 13 variables, with 1599 observations.The 13 variables include the index variable X, dependent variable quality and other independent variables, such as fixed acidity, volatile acidity, citrical acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol. These independent variables determine the quality of the red wine.

Main Feature of Interest

The main feature of interest in the dataset is the quality of the red wine. The quality of the red wine is between 0 (very bad) and 10 (very excellent). Most red wines have a quality of 5 and 6. In this project, I’d like to determine which features are best for predicting the quality of red wines.

Factors that may Affect Quality of Red Wine

The acidity, residual sugar, chlorides, total sulfur dioxide, density, pH, sulphates and alcohols are likely to contribute to the quality of the red wine.

New Variable Creation

I create a new variable named total acidity from existing variables in the dataset. It equals the sum of fixed acidity and volatile acidity.

Unusual Distribution of Features

The distribution of critic acid appear bimodal and the distributions of the residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide and sulphates are all skewed to the right. So I transform the data in a log scale.

Bivariate Plots Section

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I create a correlation matrics to figure out the relationship between each variable. Here are what I find:

  1. Volatile acidity and alcohol are most strongly related to quality while citric acid and sulphates are moderately correlated with quality.

  2. Fixed acidity is strongly related to citric acid, density and PH.

  3. Alcohol has strongly negative correlation with density.

In the next step, I would like to roughly figure our the trend of features of wines against qualities.

The medians of fixed acidities around 8 with the increase of quality.For the medians, we could find that from quality of 3 to 7, the medians of fixed acidities are incleasing and wines with quality 7 and 8 both has medians of fixed acidities over 8. There might be some positive correlations bewtween these two variables.

Volatile acidity is negatively correlated with quality. With the increase of red wines’ quality, the medians of volatile acidities decrease.

The medians of citric acid are increasing from quality 3 to quality 8 and the upper and lower whiskers, generally, are increasing with the quality. So citric acid could be positive related to the quality of wines.

The residual sugar has almost no effect on the quality of red wine. With quality increasing, the residual sugar is stable.

We can find that chlorides of wines with quality 5,6 and 7 have more outliers comparing to the rest of qualities. I will focus more on the majority under the upper fence. I will scale the yaxis and see the majority.

## Warning: Removed 41 rows containing non-finite values (stat_boxplot).

If we zoom in the y between 0 and 0.2, we can find that the medians of chlorides have a decreasing trend on qualities. The quality 3 has the median of chlorides of 0.0905 while the median of chlorides of wine with quality 8 has 0.0705, which is more than 20% less. We can also observe that the lower whiskers are decreasing with increasing qualities. The wines with higher quality might have lower chlorides based on the graph.

The trends of total sulfur dioxide and free sulfur dioxide have a very similar patterns, both having high medians in quality 5 and decreasing to both sides. This might be due to the correlations between these two variables.

The free sulfur dioxide and total sulfur dioxide have a relatively clear linear pattern with scales of axis in logarithm. This correlation may result in the similar distribution in two boxplots above.

Before I thought quality seems to be raleted to density. However from this graph, we can find that density doesn’t have a strong correlation with quality.

In the boxplot above, a trend of decreasing pH values with increasing qualities could be observed. This trend is not only about medians, but also the lower fence and upper fence.

The median of sulphates is increasing while the qualities of wines increase as well, so do the IQR.

Alcohol is positively related to the quality of red wine. And alcohol is closely related to the density of the red wine since the density of alcohol is less than that of water. Also, fixed acidity is closely related to the density. The higher fixed acidity ofthe red wines, the higher density the red wines are. The graph of fixed acidity and density is shown as follows:

## `geom_smooth()` using method = 'gam'

Density increases with the increase of fixed acidity.

Bivariate Analysis

Relationship between Red Wines’ Quality and other variables

Volatile acidity and alcohol are most strongly related to quality while citric acid and sulphates are moderately correlated with quality. Volatile acidity are negatively related to the quality of the red wines while the alcohol are positively related to the quality. Citric acid and sulphates are both positively related to the quality. However, the correlation between these are relatively small, less than 0.3.

Relationships between Other Features

Fixed acidity has high positive correlation with citric acid, density and pH. Alcohol has strongly negative correlation with density.

Multivariate Plots Section

First, I create a scatter plot about the relationship between volatile acidity and alcohol, which have relatively high correlations with the quality of red wine.

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.

I find that red wines with better quality grade tend to have more alcohol and lower volatile acidity. I also create a grpah to show the attributes of alcohol and volatile acidity in each quality category.Most red wines’ quality is within the range of 5 and 6.

For here, I classify citric acid based on the median, Wines with citric acid over the medians are colored blue and the ones with citric acid less than the median are colored red. From quality 3 to 8, we can see free sulfur dioxide has a positive linear relationship with total sulfur dioxide in wines with different qualities. Also, we can see wines with higher quality seem to have more citric acid. It looks like blue points are taking more places with the quality increasing.

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = rw)
## m2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity, 
##     data = rw)
## m3: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates, data = rw)
## m4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid, data = rw)
## m5: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid + fixed.acidity, data = rw)
## 
## ==========================================================================================
##                          m1            m2            m3            m4            m5       
## ------------------------------------------------------------------------------------------
##   (Intercept)          -0.125         1.095***      0.611**       0.646**       0.202     
##                        (0.175)       (0.184)       (0.196)       (0.201)       (0.224)    
##   alcohol               0.361***      0.314***      0.309***      0.309***      0.320***  
##                        (0.017)       (0.016)       (0.016)       (0.016)       (0.016)    
##   volatile.acidity                   -1.384***     -1.221***     -1.265***     -1.343***  
##                                      (0.095)       (0.097)       (0.113)       (0.113)    
##   sulphates                                         0.679***      0.696***      0.701***  
##                                                    (0.101)       (0.103)       (0.103)    
##   citric.acid                                                    -0.079        -0.469***  
##                                                                  (0.104)       (0.137)    
##   fixed.acidity                                                                 0.057***  
##                                                                                (0.013)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.227         0.317         0.336         0.336         0.344     
##   adj. R-squared        0.226         0.316         0.335         0.334         0.342     
##   sigma                 0.710         0.668         0.659         0.659         0.655     
##   F                   468.267       370.379       268.912       201.777       167.023     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1621.814     -1599.384     -1599.093     -1589.648     
##   Deviance            805.870       711.796       692.105       691.852       683.728     
##   AIC                3448.114      3251.628      3208.768      3210.186      3193.297     
##   BIC                3464.245      3273.136      3235.654      3242.448      3230.937     
##   N                  1599          1599          1599          1599          1599         
## ==========================================================================================
## 
## Calls:
## n1: lm(formula = as.numeric(quality) ~ alcohol, data = rw)
## n2: lm(formula = as.numeric(quality) ~ alcohol + log(volatile.acidity), 
##     data = rw)
## n3: lm(formula = as.numeric(quality) ~ alcohol + log(volatile.acidity) + 
##     log(sulphates), data = rw)
## n4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid, data = rw)
## n5: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid + fixed.acidity, data = rw)
## 
## ===============================================================================================
##                               n1            n2            n3            n4            n5       
## -----------------------------------------------------------------------------------------------
##   (Intercept)               -0.125        -0.062         0.415*        0.646**       0.202     
##                             (0.175)       (0.165)       (0.171)       (0.201)       (0.224)    
##   alcohol                    0.361***      0.309***      0.299***      0.309***      0.320***  
##                             (0.017)       (0.016)       (0.016)       (0.016)       (0.016)    
##   log(volatile.acidity)                   -0.680***     -0.564***                              
##                                           (0.049)       (0.050)                                
##   log(sulphates)                                         0.659***                              
##                                                         (0.077)                                
##   volatile.acidity                                                    -1.265***     -1.343***  
##                                                                       (0.113)       (0.113)    
##   sulphates                                                            0.696***      0.701***  
##                                                                       (0.103)       (0.103)    
##   citric.acid                                                         -0.079        -0.469***  
##                                                                       (0.104)       (0.137)    
##   fixed.acidity                                                                      0.057***  
##                                                                                     (0.013)    
## -----------------------------------------------------------------------------------------------
##   R-squared                  0.227         0.311         0.341         0.336         0.344     
##   adj. R-squared             0.226         0.310         0.340         0.334         0.342     
##   sigma                      0.710         0.671         0.656         0.659         0.655     
##   F                        468.267       359.989       275.225       201.777       167.023     
##   p                          0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood         -1721.057     -1628.955     -1593.104     -1599.093     -1589.648     
##   Deviance                 805.870       718.183       686.690       691.852       683.728     
##   AIC                     3448.114      3265.910      3196.209      3210.186      3193.297     
##   BIC                     3464.245      3287.419      3223.095      3242.448      3230.937     
##   N                       1599          1599          1599          1599          1599         
## ===============================================================================================
## 
## Calls:
## n1: lm(formula = as.numeric(quality) ~ alcohol, data = rw)
## n2: lm(formula = as.numeric(quality) ~ alcohol + log(volatile.acidity), 
##     data = rw)
## n3: lm(formula = as.numeric(quality) ~ alcohol + log(volatile.acidity) + 
##     log(sulphates), data = rw)
## n4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid, data = rw)
## n5: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid + fixed.acidity, data = rw)
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = rw)
## m2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity, 
##     data = rw)
## m3: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates, data = rw)
## m4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid, data = rw)
## m5: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     sulphates + citric.acid + fixed.acidity, data = rw)
## 
## =====================================================================================================================================================================
##                               n1            n2            n3            n4            n5            m1            m2            m3            m4            m5       
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               -0.125        -0.062         0.415*        0.646**       0.202        -0.125         1.095***      0.611**       0.646**       0.202     
##                             (0.175)       (0.165)       (0.171)       (0.201)       (0.224)       (0.175)       (0.184)       (0.196)       (0.201)       (0.224)    
##   alcohol                    0.361***      0.309***      0.299***      0.309***      0.320***      0.361***      0.314***      0.309***      0.309***      0.320***  
##                             (0.017)       (0.016)       (0.016)       (0.016)       (0.016)       (0.017)       (0.016)       (0.016)       (0.016)       (0.016)    
##   log(volatile.acidity)                   -0.680***     -0.564***                                                                                                    
##                                           (0.049)       (0.050)                                                                                                      
##   log(sulphates)                                         0.659***                                                                                                    
##                                                         (0.077)                                                                                                      
##   volatile.acidity                                                    -1.265***     -1.343***                   -1.384***     -1.221***     -1.265***     -1.343***  
##                                                                       (0.113)       (0.113)                     (0.095)       (0.097)       (0.113)       (0.113)    
##   sulphates                                                            0.696***      0.701***                                  0.679***      0.696***      0.701***  
##                                                                       (0.103)       (0.103)                                   (0.101)       (0.103)       (0.103)    
##   citric.acid                                                         -0.079        -0.469***                                               -0.079        -0.469***  
##                                                                       (0.104)       (0.137)                                                 (0.104)       (0.137)    
##   fixed.acidity                                                                      0.057***                                                              0.057***  
##                                                                                     (0.013)                                                               (0.013)    
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                  0.227         0.311         0.341         0.336         0.344         0.227         0.317         0.336         0.336         0.344     
##   adj. R-squared             0.226         0.310         0.340         0.334         0.342         0.226         0.316         0.335         0.334         0.342     
##   sigma                      0.710         0.671         0.656         0.659         0.655         0.710         0.668         0.659         0.659         0.655     
##   F                        468.267       359.989       275.225       201.777       167.023       468.267       370.379       268.912       201.777       167.023     
##   p                          0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood         -1721.057     -1628.955     -1593.104     -1599.093     -1589.648     -1721.057     -1621.814     -1599.384     -1599.093     -1589.648     
##   Deviance                 805.870       718.183       686.690       691.852       683.728       805.870       711.796       692.105       691.852       683.728     
##   AIC                     3448.114      3265.910      3196.209      3210.186      3193.297      3448.114      3251.628      3208.768      3210.186      3193.297     
##   BIC                     3464.245      3287.419      3223.095      3242.448      3230.937      3464.245      3273.136      3235.654      3242.448      3230.937     
##   N                       1599          1599          1599          1599          1599          1599          1599          1599          1599          1599         
## =====================================================================================================================================================================

I create 10 models to predict the quality of red wines using other significant variables. The result is not very good since the R-square is relatively small.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

Most red wines’ quality us clustered around the range of 5 and 6 whhen their volatile acidity is clustered around 0.4 to 0.8 and alcohol is clustered around 9 to 11.And wine with higher quality tend to have more alcohol and citric acid and less volatile acidity.

Did you create any models with your dataset? Discuss the and limitations of your model.

I create 10 models in total with my dataset to predict the quality by using several variables which have relatively high coefficients with quality. The limitation of my models is that the R-square is relatively low, which means most information is needed for prediction of red wine quality.

Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

## `geom_smooth()` using method = 'gam'

Description One

The plot 1 shows the relationship between fixed acidity and density. with the increase of fixed acidity, the density increases at the same time.

Plot Two

## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.

Description Two

Plot two shows the relationship among volatile acidity, alcohol and quality. Red wines with better quality grade tend to have more alcohol and lower volatile acidity and most red wines’ quality is within the range of 5 and 6.

Plot Three

Description Three

In plot 3, I classify citric acid based on the median, Wines with citric acid over the medians are colored blue and the ones with citric acid less than the median are colored red. From quality 3 to 8, we can find that free sulfur dioxide has a positive linear relationship with total sulfur dioxide in wines with different qualities. Also, we can see wines with higher quality seem to have more citric acid. It looks like blue points are taking more places with the quality increasing.

Reflection

This project is about the chemical properties and the quality of red wines. It consists of 13 variables, with 1599 observations.The 13 variables include the index variable X, dependent variable quality and other independent variables, such as fixed acidity, volatile acidity, citrical acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol. These independent variables may determine the quality of the red wines.

In this project, I first explore the property of each individual variable (excluding index variable) and figure out the distribution of these variables. And then, I study the relationsip between two variable. Finally, I use some variables, which have a higher correlation with quality to predict the red wine’s quality. The result may not very good because the R-square is relatively low. I am looking forward to find more information which may be helpful to predict the quality of red wines or using some more complex methods, such as machine learning, to predict the quality of red wines.